In [53]:
import sys
from time import time
from os.path import expanduser
sys.path.append(expanduser("~/Documents/ud120-projects-master/tools/"))
from email_preprocess import preprocess
import numpy as np
import pandas as pd
from __future__ import division


### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()


no. of Chris training emails: 7936
no. of Sara training emails: 7884

Naive Bayes


In [2]:
import sklearn.naive_bayes as nb

My own exploration of the data


In [3]:
alabels_test=np.array(labels_test)

In [4]:
alabels_test.shape


Out[4]:
(1758,)

In [5]:
features_test.shape


Out[5]:
(1758, 3785)

This indicates axis 0 is emails, and 1 is words(features?)?


In [6]:
features_test[:10].sum(axis=1)


Out[6]:
array([ 4.1284468 ,  1.77616023,  3.40693072,  3.80264453,  1.        ,
        4.18466885,  3.60751176,  3.82152026,  5.67650369,  3.79537177])

In [7]:
features_test.sum()/features_test.shape[0]


Out[7]:
3.7190368141442387

I don't understand how this information is coded. The logical thing to me would be an integer record of whether a word is used in an email, but since it appears that at least most of the first ten contain non integer amounts, that doesn't seem to be the case. Anyway, I can't seem to understand the average length of the emails, which would be an indicator of how confident one can be of writers of the emails.

Q1: What is the accuracy?

In [8]:
%%time
model=nb.GaussianNB()
model.fit(features_train,labels_train)


CPU times: user 805 ms, sys: 566 ms, total: 1.37 s
Wall time: 1.38 s

In [9]:
%%time
testprediction=model.predict(features_test)


CPU times: user 117 ms, sys: 32.2 ms, total: 149 ms
Wall time: 148 ms

In [11]:
(testprediction==alabels_test).sum()/alabels_test.shape[0]


Out[11]:
0.97326507394766781

This is astoundingly good. I would never guess that people are so reliable in their choice of words.

Q2: Which takes longer, training or prediction

We see that training takes a good deal longer; not surprising at all.

SVM


In [12]:
from sklearn import svm
My own fooling around.

In [49]:
sub_features_train=features_train[:int(round(features_train.shape[0]/100))]
sub_labels_train=labels_train[:int(round(features_train.shape[0]/100))]

In [50]:
kernels=['linear', 'poly', 'rbf', 'sigmoid']

In [51]:
for i in kernels:
    print(i)
    model=svm.SVC(kernel=i)
    %time model.fit(sub_features_train,sub_labels_train)
    print((model.predict(features_test)==alabels_test).sum()/alabels_test.shape[0])


linear
CPU times: user 9.98 ms, sys: 12 µs, total: 10 ms
Wall time: 9.05 ms
0.854948805461
poly
CPU times: user 11.1 ms, sys: 0 ns, total: 11.1 ms
Wall time: 11.1 ms
0.522184300341
rbf
CPU times: user 13.7 ms, sys: 0 ns, total: 13.7 ms
Wall time: 13.7 ms
0.589874857793
sigmoid
CPU times: user 10.2 ms, sys: 0 ns, total: 10.2 ms
Wall time: 10.2 ms
0.492036405006

Wow. I did not think that the kernel mattered that much. I guess a highly complex kernel requires some tuning of hyper-parameters? I really need to understand better how to tailor svm kernels to the data. It's pretty hard to do though, as long as this data is a black box.

Q1: What is the accuracy of the linear classifier?

In [13]:
model=svm.SVC(kernel='linear')

In [14]:
%%time
model.fit(features_train,labels_train)


CPU times: user 2min 31s, sys: 218 ms, total: 2min 31s
Wall time: 2min 31s
Out[14]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [15]:
%%time
testprediction=model.predict(features_test)


CPU times: user 15.9 s, sys: 3.88 ms, total: 15.9 s
Wall time: 15.9 s

In [16]:
np.sum(testprediction==alabels_test)/alabels_test.shape[0]


Out[16]:
0.98407281001137659
Q2: How do the training and prediction times compare to Naive Bayes?

In [17]:
#training time
lsvm=(2*60+13)*1000
lsvm/805


Out[17]:
165.2173913043478

In [34]:
#prediction time
15.9*1000/117


Out[34]:
135.89743589743588

Very poorly. Training and prediction times are both well over 100 times longer for SVM.

Q3: What is the accuracy after shrinking the training set to 1% of it's original size?

In [19]:
sub_features_train=features_train[:int(round(features_train.shape[0]/100))]
sub_labels_train=labels_train[:int(round(features_train.shape[0]/100))]

In [20]:
model=svm.SVC(kernel='linear')
%time model.fit(sub_features_train,sub_labels_train)
%time testprediction=model.predict(features_test)
np.sum(testprediction==alabels_test)/alabels_test.shape[0]


CPU times: user 92.5 ms, sys: 5 µs, total: 92.5 ms
Wall time: 91.3 ms
CPU times: user 916 ms, sys: 1e+03 ns, total: 916 ms
Wall time: 916 ms
Out[20]:
0.88452787258248011

That's an awful lot faster, and doesn't do too bad as far a prediction goes. The prediction time is also drammatically lower.

Q4: Which of these are applications where you can imagine a very quick-running algorithm is especially important?

Flagging credit card fraud, and blocking a transaction before it goes through and voice recognition, like Siri, both would require quick prediction time. However, training time for both of these can be long. There are very few applications where long training times are not acceptable for the final product, though long training times can definitely make testing difficult.

Q5: What’s the accuracy with the more complex rbf kernel?

In [21]:
model=svm.SVC(kernel='rbf')
%time model.fit(sub_features_train,sub_labels_train)
%time testprediction=model.predict(features_test)
np.sum(testprediction==alabels_test)/alabels_test.shape[0]


CPU times: user 113 ms, sys: 6 µs, total: 113 ms
Wall time: 112 ms
CPU times: user 1.06 s, sys: 0 ns, total: 1.06 s
Wall time: 1.06 s
Out[21]:
0.61604095563139927

In [22]:
asub_labels_train=np.array(sub_labels_train)
np.sum(model.predict(sub_features_train)==asub_labels_train)/asub_labels_train.shape[0]


Out[22]:
0.67721518987341767

I would have guessed that this kernel might just be overfitting the data, but that doesn't quite seem to be the case - it doesn't even predict the training data well. It must be that it is "underfitting", and doesn't have the freedom to change the shape to match the data.

Q6 & Q7: What value of C gives the best accuracy, and what is it?

In [24]:
a=10**np.arange(1,10)
for i in a:
    model=svm.SVC(kernel='rbf',C=i)
    print('C='+str(i))
    print('fitting:')
    %time model.fit(sub_features_train,sub_labels_train)
    print('prediction:')
    %time testprediction=model.predict(features_test)
    print('accuracy='+str(np.sum(testprediction==alabels_test)/alabels_test.shape[0]))


C=10
fitting:
CPU times: user 105 ms, sys: 0 ns, total: 105 ms
Wall time: 103 ms
prediction:
CPU times: user 1.06 s, sys: 0 ns, total: 1.06 s
Wall time: 1.06 s
accuracy=0.616040955631
C=100
fitting:
CPU times: user 106 ms, sys: 0 ns, total: 106 ms
Wall time: 106 ms
prediction:
CPU times: user 1.15 s, sys: 0 ns, total: 1.15 s
Wall time: 1.15 s
accuracy=0.616040955631
C=1000
fitting:
CPU times: user 98.5 ms, sys: 0 ns, total: 98.5 ms
Wall time: 98.4 ms
prediction:
CPU times: user 1.04 s, sys: 0 ns, total: 1.04 s
Wall time: 1.04 s
accuracy=0.821387940842
C=10000
fitting:
CPU times: user 97.2 ms, sys: 0 ns, total: 97.2 ms
Wall time: 97.1 ms
prediction:
CPU times: user 851 ms, sys: 0 ns, total: 851 ms
Wall time: 851 ms
accuracy=0.892491467577
C=100000
fitting:
CPU times: user 94 ms, sys: 0 ns, total: 94 ms
Wall time: 94 ms
prediction:
CPU times: user 799 ms, sys: 0 ns, total: 799 ms
Wall time: 799 ms
accuracy=0.860068259386
C=1000000
fitting:
CPU times: user 90.2 ms, sys: 0 ns, total: 90.2 ms
Wall time: 90.1 ms
prediction:
CPU times: user 792 ms, sys: 0 ns, total: 792 ms
Wall time: 792 ms
accuracy=0.860068259386
C=10000000
fitting:
CPU times: user 93.7 ms, sys: 0 ns, total: 93.7 ms
Wall time: 93.8 ms
prediction:
CPU times: user 780 ms, sys: 24 µs, total: 780 ms
Wall time: 783 ms
accuracy=0.860068259386
C=100000000
fitting:
CPU times: user 89.4 ms, sys: 1 µs, total: 89.4 ms
Wall time: 92.1 ms
prediction:
CPU times: user 782 ms, sys: 13 µs, total: 782 ms
Wall time: 782 ms
accuracy=0.860068259386
C=1000000000
fitting:
CPU times: user 98.2 ms, sys: 0 ns, total: 98.2 ms
Wall time: 98.2 ms
prediction:
CPU times: user 797 ms, sys: 0 ns, total: 797 ms
Wall time: 797 ms
accuracy=0.860068259386

There seems to be an ideal value for this. The default value, 1, is insufficiantly complex to follow the data and is underfit, and at some point, around C=10000, it shifts to being overfit, as the complexities of the decision boundary allow it to simply follow every single point, with the ideal value around 10000.

However, I should note that doing this process is bad data science - we are over-fitting the parameter C to the test data set. Udacity's suggestion doesn't even do this bad process right, it has us stop at 10000, and we don't even know if we could do better by going higher.


In [23]:
a=np.linspace(10000-5000,10000+5000,num=10,dtype=int)
for i in a:
    model=svm.SVC(kernel='rbf',C=i)
    print('C='+str(i))
    print('fitting:')
    %time model.fit(sub_features_train,sub_labels_train)
    print('prediction:')
    %time testprediction=model.predict(features_test)
    print('accuracy='+str(np.sum(testprediction==alabels_test)/alabels_test.shape[0]))


C=5000
fitting:
CPU times: user 103 ms, sys: 25 µs, total: 103 ms
Wall time: 102 ms
prediction:
CPU times: user 851 ms, sys: 0 ns, total: 851 ms
Wall time: 851 ms
accuracy=0.899317406143
C=6111
fitting:
CPU times: user 93.9 ms, sys: 0 ns, total: 93.9 ms
Wall time: 93.9 ms
prediction:
CPU times: user 859 ms, sys: 0 ns, total: 859 ms
Wall time: 859 ms
accuracy=0.900455062571
C=7222
fitting:
CPU times: user 95.2 ms, sys: 0 ns, total: 95.2 ms
Wall time: 95.2 ms
prediction:
CPU times: user 877 ms, sys: 0 ns, total: 877 ms
Wall time: 877 ms
accuracy=0.896473265074
C=8333
fitting:
CPU times: user 93.9 ms, sys: 0 ns, total: 93.9 ms
Wall time: 93.9 ms
prediction:
CPU times: user 899 ms, sys: 0 ns, total: 899 ms
Wall time: 899 ms
accuracy=0.894197952218
C=9444
fitting:
CPU times: user 93.9 ms, sys: 0 ns, total: 93.9 ms
Wall time: 93.9 ms
prediction:
CPU times: user 858 ms, sys: 0 ns, total: 858 ms
Wall time: 858 ms
accuracy=0.892491467577
C=10555
fitting:
CPU times: user 96.2 ms, sys: 0 ns, total: 96.2 ms
Wall time: 96.2 ms
prediction:
CPU times: user 866 ms, sys: 0 ns, total: 866 ms
Wall time: 866 ms
accuracy=0.892491467577
C=11666
fitting:
CPU times: user 94.1 ms, sys: 0 ns, total: 94.1 ms
Wall time: 94.1 ms
prediction:
CPU times: user 842 ms, sys: 0 ns, total: 842 ms
Wall time: 842 ms
accuracy=0.889078498294
C=12777
fitting:
CPU times: user 94.1 ms, sys: 0 ns, total: 94.1 ms
Wall time: 94.1 ms
prediction:
CPU times: user 864 ms, sys: 0 ns, total: 864 ms
Wall time: 865 ms
accuracy=0.887940841866
C=13888
fitting:
CPU times: user 94.5 ms, sys: 0 ns, total: 94.5 ms
Wall time: 94.6 ms
prediction:
CPU times: user 842 ms, sys: 0 ns, total: 842 ms
Wall time: 842 ms
accuracy=0.885096700796
C=15000
fitting:
CPU times: user 103 ms, sys: 0 ns, total: 103 ms
Wall time: 103 ms
prediction:
CPU times: user 908 ms, sys: 0 ns, total: 908 ms
Wall time: 908 ms
accuracy=0.883390216155
Q8: What is the accuracy with the full set and "optimized" C?

In [25]:
model=svm.SVC(kernel='rbf',C=10000)

In [26]:
%%time
model.fit(features_train,labels_train)


CPU times: user 1min 39s, sys: 88 ms, total: 1min 39s
Wall time: 1min 39s
Out[26]:
SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

That took a while, but it was still an awful lot shorter than the 15 minutes it took with C=1


In [30]:
%%time
testprediction=model.predict(features_test)


CPU times: user 10.3 s, sys: 20 µs, total: 10.3 s
Wall time: 10.3 s

In [28]:
labels_test=np.array(labels_test)

In [31]:
(np.sum(testprediction==labels_test))/labels_test.shape[0]


Out[31]:
0.99089874857792948

That's so good that I have a hard time believing it. As I was using the test set to choose my C, it's probable that I have overfitted to the test set. In order to do this process properly, the training set should be devided into subsets, and one subset used for training, one for testing, and then the acuraccy should be maximized with respect to the parameter. Then you can test on a larger test and training set to get a real understanding of the accuracy. The sklearn.grid_search.GridSearchCV method uses cross validation, a slightly more complex version of what I just described to do this.

Q9: What is the prediction for these examples?

In [32]:
print('elm 10 prediction = ' + str(testprediction[10])+', actual = '+ str(labels_test[10]))
print('elm 26 prediction = ' + str(testprediction[26])+', actual = '+ str(labels_test[26]))
print('elm 50 prediction = ' + str(testprediction[50])+', actual = '+ str(labels_test[50]))


elm 10 prediction = 1, actual = 1
elm 26 prediction = 0, actual = 0
elm 50 prediction = 1, actual = 1

In [33]:
np.sum(testprediction)


Out[33]:
877

Decision Trees


In [36]:
from sklearn.tree import DecisionTreeClassifier
Q1: What is the accuracy with the minimum sample split equal to 40?

In [37]:
model=DecisionTreeClassifier(min_samples_split=40)

In [38]:
%%time
model.fit(features_train,labels_train)


CPU times: user 57.8 s, sys: 76.1 ms, total: 57.8 s
Wall time: 57.8 s
Out[38]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=40, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [39]:
%%time
testprediction=model.predict(features_test)


CPU times: user 14.2 ms, sys: 8.01 ms, total: 22.2 ms
Wall time: 21 ms

In [40]:
labels_test=np.array(labels_test)

In [41]:
(testprediction==labels_test).sum()/labels_test.shape[0]


Out[41]:
0.97724687144482369
Q2: Speeding Up Via Feature Selection

In [44]:
features_train.shape[1]


Out[44]:
3785

The feature selection algorithm is selecting only the features that are most well correlated with the data, with the correlation in this case measured by a $\chi ^2$ correlation test between the feature and the labels. In this case, we're picking out the top 10% most highly correlated variables.

Q3: Smaller number of features

In [45]:
from email_preprocess import preprocesssmall
features_train, features_test, labels_train, labels_test = preprocesssmall()


no. of Chris training emails: 7936
no. of Sara training emails: 7884

In [46]:
features_train.shape[1]


Out[46]:
379

With a smaller number of variables, we cannot have a less complex decision surface. If we add a completely random feature, however, with an ideal machine learning algorithm there should be no increased complexity, and in general a good algorithm should only increase the complexity of the decision surface with more features if the new features add useful information about the labels.

Since we are dropping features that are fairly highly correlated with the labels, this will decrease the complexity of the decision surface.

Q5: Accuracy with less features

In [48]:
%time model.fit(features_train,labels_train)
%time testprediction=model.predict(features_test)
labels_test=np.array(labels_test)
(testprediction==labels_test).sum()/labels_test.shape[0]


CPU times: user 3.78 s, sys: 14 µs, total: 3.78 s
Wall time: 3.78 s
CPU times: user 1.9 ms, sys: 0 ns, total: 1.9 ms
Wall time: 1.9 ms
Out[48]:
0.96643913538111492

Based on these two data points (hardly enough data to really figure out), I'd guess that model fitting and prediction time scales at least linearly with the number of features, if not with some higher power.